Skip to main content

View Post [edit]

Poster: zwol Date: Dec 4, 2015 9:37am
Forum: web Subject: What is the algorithm for deciding when to not crawl a page anymore?

How does the crawler decide that a page, or a whole website, no longer exists and should be dropped from the list of things to be crawled? For instance, if the DNS for a server hostname stops resolving, presumably the crawler will _eventually_ decide that the site is gone forever, but it (also presumably) doesn't do that the very first time a DNS lookup fails.

I am looking for a full, concrete description of the algorithm; a reference to actual code would be ideal.